chore(flushing): standardize code with refactoring on some flushers and retries#1018
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors the trace flushing code to standardize and simplify HTTP client management. The main goal is to create a single, reusable HTTP client instance that can be shared across multiple flush operations, improving performance through connection pooling and TLS session reuse.
Changes:
- Removed trait-based abstractions for
TraceFlusherandStatsFlusherin favor of concrete types - Extracted HTTP client creation logic into a shared
hyper_clientmodule - Added lazy initialization of HTTP clients using
OnceCellfor caching and reuse
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| bottlecap/src/traces/trace_flusher.rs | Removed TraceFlusher trait and ServerlessTraceFlusher implementation; added cached HTTP client with OnceCell; moved HTTP client creation to separate module |
| bottlecap/src/traces/stats_flusher.rs | Removed StatsFlusher trait and ServerlessStatsFlusher implementation; added cached HTTP client with OnceCell; updated to use shared hyper_client module |
| bottlecap/src/traces/mod.rs | Added new hyper_client module to public exports |
| bottlecap/src/traces/hyper_client.rs | New module containing shared HTTP client creation logic and type definitions |
| bottlecap/src/flushing/service.rs | Removed generic type parameters from FlushingService to use concrete flusher types |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let client_clone = http_client.clone(); | ||
| batch_tasks.spawn(async move { | ||
| Self::send(traces_clone, Some(&endpoint), &proxy_https, &tls_cert_file).await | ||
| Self::send_traces(traces_clone, Some(endpoint), client_clone).await |
There was a problem hiding this comment.
Passing Some(endpoint) here while passing None on line 121 creates an inconsistency. The endpoint variable is already an Endpoint from the loop, but send_traces expects Option<Endpoint>. Consider restructuring send_traces to accept &Endpoint directly and have a separate internal method or branch for the default endpoint case to make the API clearer.
There was a problem hiding this comment.
Seems like a pre-existing bug
There was a problem hiding this comment.
To be addressed on another PR
|
@lym953 might need a review from you, I'm adding retries on stats:-) amongst other things |
| } | ||
|
|
||
| impl ServerlessTraceFlusher { | ||
| pub fn get_http_client( |
There was a problem hiding this comment.
this created a client on every call
we were creating a client every time when flushing traces, now we just use one, also removes unnecessary traits as we are not creating more tracing agents for other use cases
030fd5d to
7590bd2
Compare
lym953
left a comment
There was a problem hiding this comment.
Thanks for the refactoring and for adding comments!
|
Is the a limit on the number of retries? |
I think for now it's 2, not the standard on other services |
## Overview Continuation of #1018 removing unnecessary mut lock on callers for dogstatsd
## Overview Continuation of #1018 removing unnecessary mut lock on callers for dogstatsd
… Lambda ## Problem After upgrading from extension v92 to v93, customers reported a sharp increase in "Max retries exceeded, returning request error" errors (SVLS-8672, GitHub issue #1092). ## Root Cause PR #1018 introduced HTTP client caching for performance improvements. However, the cached client maintains a connection pool that doesn't work well with Lambda's freeze/resume execution model: 1. Lambda executes, HTTP client created with connection pool 2. Extension flushes traces, connections remain open in pool 3. Lambda freezes (paused between invocations - seconds to minutes) 4. Lambda resumes, cached client reuses stale connections 5. TCP errors → "Max retries exceeded" In v92, a new HTTP client was created per-flush, so there were never stale connections to reuse. ## Solution Disable connection pooling by setting `pool_max_idle_per_host(0)`. This ensures each request gets a fresh connection, avoiding stale connection issues while still benefiting from client caching. This matches the pattern used in libdatadog's `new_client_periodic()` which explicitly disables pooling with the comment: "This client does not keep connections because otherwise we would get a pipe closed every second connection because of low keep alive in the agent." Fixes: SVLS-8672 Fixes: #1092 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… Lambda (#1094) ## Summary Fixes a regression introduced in v93 where customers see a sharp increase in "Max retries exceeded, returning request error" errors after upgrading from v92. - Disables HTTP connection pooling for the trace/stats flusher by setting `pool_max_idle_per_host(0)` - Prevents stale connections from being reused after Lambda freeze/resume cycles ## Problem PR #1018 introduced HTTP client caching for performance improvements. However, the cached client maintains a connection pool that doesn't work well with Lambda's freeze/resume execution model: 1. Lambda executes, HTTP client created with connection pool 2. Extension flushes traces, connections remain open in pool 3. Lambda **freezes** (paused between invocations - can be seconds to minutes) 4. Lambda **resumes**, cached client reuses stale connections 5. TCP errors → "Max retries exceeded" In v92, a new HTTP client was created per-flush, so there were never stale connections to reuse. ## Solution Disable connection pooling by setting `pool_max_idle_per_host(0)`. This ensures each request gets a fresh connection, avoiding stale connection issues while still benefiting from client caching (TLS session reuse, configuration reuse, etc.). This matches the pattern used in libdatadog's `new_client_periodic()` which explicitly disables pooling with the comment: > "This client does not keep connections because otherwise we would get a pipe closed every second connection because of low keep alive in the agent." ## Related - Fixes [SVLS-8672](https://datadoghq.atlassian.net/browse/SVLS-8672) - Fixes #1092 - Regression introduced in #1018 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: jordan gonzález <30836115+duncanista@users.noreply.github.com>
Overview
Simplify code for flushing, trying to standardize everything by avoiding code all over the place, ensuring that we only create one client and we can reuse as much as possible for performance improvements
Motivation
SVLS-8507